Bimodal Content Defined Chunking for Backup Streams

نویسندگان

  • Erik Kruus
  • Cristian Ungureanu
  • Cezary Dubnicki
چکیده

Data deduplication has become a popular technology for reducing the amount of storage space necessary for backup and archival data. Content defined chunking (CDC) techniques are well established methods of separating a data stream into variable-size chunks such that duplicate content has a good chance of being discovered irrespective of its position in the data stream. Requirements for CDC include fast and scalable operation, as well as achieving good duplicate elimination. While the latter can be achieved by using chunks of small average size, this also increases the amount of metadata necessary to store the relatively more numerous chunks, and impacts negatively the system’s performance. We propose a new approach that achieves comparable duplicate elimination while using chunks of larger average size. It involves using two chunk size targets, and mechanisms that dynamically switch between the two based on querying data already stored; we use small chunks in limited regions of transition from duplicate to nonduplicate data, and elsewhere we use large chunks. The algorithms rely on the block store’s ability to quickly deliver a high-quality reply to existence queries for alreadystored blocks. A chunking decision is made with limited lookahead and number of queries. We present results of running these algorithms on actual backup data, as well as four sets of source code archives. Our algorithms typically achieve similar duplicate elimination to standard algorithms while using chunks 2–4 times as large. Such approaches may be particularly interesting to distributed storage systems that use redundancy techniques (such as error-correcting codes) requiring multiple chunk fragments, for which metadata overheads per stored chunk are high. We find that algorithm variants with more flexibility in location and size of chunks yield better duplicate elimination, at a cost of a higher number of existence queries.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Efficient Data Deduplication based on Tar-format Awareness in Backup Applications

Disk-based backup storage system is utilized widely, and data deduplication is becoming an essential technique in the system because of the advantage of a spaceefficiency. Usually, user’s several files are aggregated into a single Tar file at primary storage, and the Tar file is transferred and stored to the backup storage system periodically (e.g., a weekly full backup) [1]. In this paper, we ...

متن کامل

A Relational Approach to Querying Streams

Data streams are long, relatively unstructured sequences of characters that contain information such as electronic mail or a tape backup of various documents and reports created in an office. This paper deals with a conceptual framework, using relational algebra and relational databases, within which data streams may be queried. As information is extracted from the data stream, it is put into a...

متن کامل

Leap-based Content Defined Chunking - Theory and Implementation

Content Defined Chunking (CDC) is an important component in data deduplication, which affects both the deduplication ratio as well as deduplication performance. The sliding-window-based CDC algorithm and its variants have been the most popular CDC algorithms for the last 15 years. However, their performance is limited in certain application scenarios since they have to slide byte by byte. The a...

متن کامل

The assignment of chunk size according to the target data characteristics in deduplication backup system

This paper focuses on the trade-off between the deduplication rate and the processing penalty in backup system which uses a conventional variable chunking method. The trade-off is a nonlinear negative correlation if the chunk size is fixed. In order to analyze quantitatively the trade-off all over the factors, a simulation approach is taken and clarifies the several correlations among chunk siz...

متن کامل

Survey of Research on Chunking Techniques

The explosive growth of data produced by different devices and applications has contributed to the abundance of big data. To process such amounts of data efficiently, strategies such as De-duplication has been employed. Among the three different levels of de-duplication named as file level, block level and chunk level, De-duplication at chunk level also known as byte level is the most popular a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010